softmax bottleneck efficiently
Mixtape: Breaking the Softmax Bottleneck Efficiently
The softmax bottleneck has been shown to limit the expressiveness of neural language models. Mixture of Softmaxes (MoS) is an effective approach to address such a theoretical limitation, but are expensive compared to softmax in terms of both memory and time. We propose Mixtape, an output layer that breaks the softmax bottleneck more efficiently with three novel techniques--logit space vector gating, sigmoid tree decomposition, and gate sharing. On four benchmarks including language modeling and machine translation, the Mixtape layer substantially improves the efficiency over the MoS layer by 3.5x to 10.5x while obtaining similar performance. A network equipped with Mixtape is only 20% to 34% slower than a softmax-based network with 10-30K vocabulary sizes, and outperforms softmax in perplexity and translation quality.
Mixtape: Breaking the Softmax Bottleneck Efficiently
The softmax bottleneck has been shown to limit the expressiveness of neural lan- guage models. Mixture of Softmaxes (MoS) is an effective approach to address such a theoretical limitation, but are expensive compared to softmax in terms of both memory and time. We propose Mixtape, an output layer that breaks the softmax bottleneck more efficiently with three novel techniques--logit space vector gating, sigmoid tree decomposition, and gate sharing. On four benchmarks including language modeling and machine translation, the Mixtape layer substantially improves the efficiency over the MoS layer by 3.5x to 10.5x while obtaining similar performance. A network equipped with Mixtape is only 20% to 34% slower than a softmax-based network with 10-30K vocabulary sizes, and outperforms softmax in perplexity and translation quality.
Reviews: Mixtape: Breaking the Softmax Bottleneck Efficiently
POS-AUTHOR FEEDBACK I thank the authors for their feedback and clarifications. I have increased my score based on those answers, and trusting that the promised modifications will appear in the final version. I would strongly encourage to make the release of the code as easy to use as possible, ideally with plugins for major platforms. This would not only increase citations, but have a direct impact in a number of use-cases ORIGINAL REVIEW This paper addresses the softmax bottleneck problem: resolving it has shown to significantly improve results when the output is over a large space (eg: NLP). However, current solutions are very costly.
Reviews: Mixtape: Breaking the Softmax Bottleneck Efficiently
This paper proposes techniques to deal with the softmax bottleneck problem. Pros • Experimental results show strong performances in language modeling and machine translation. Cons • Writing of the paper can be further enhanced by making it self-contained. The paper represents solid work. There are clarity issues pointed out by the reviewers.
Mixtape: Breaking the Softmax Bottleneck Efficiently
The softmax bottleneck has been shown to limit the expressiveness of neural lan- guage models. Mixture of Softmaxes (MoS) is an effective approach to address such a theoretical limitation, but are expensive compared to softmax in terms of both memory and time. We propose Mixtape, an output layer that breaks the softmax bottleneck more efficiently with three novel techniques--logit space vector gating, sigmoid tree decomposition, and gate sharing. On four benchmarks including language modeling and machine translation, the Mixtape layer substantially improves the efficiency over the MoS layer by 3.5x to 10.5x while obtaining similar performance. A network equipped with Mixtape is only 20% to 34% slower than a softmax-based network with 10-30K vocabulary sizes, and outperforms softmax in perplexity and translation quality.
Mixtape: Breaking the Softmax Bottleneck Efficiently
Yang, Zhilin, Luong, Thang, Salakhutdinov, Russ R., Le, Quoc V.
The softmax bottleneck has been shown to limit the expressiveness of neural lan- guage models. Mixture of Softmaxes (MoS) is an effective approach to address such a theoretical limitation, but are expensive compared to softmax in terms of both memory and time. We propose Mixtape, an output layer that breaks the softmax bottleneck more efficiently with three novel techniques--logit space vector gating, sigmoid tree decomposition, and gate sharing. On four benchmarks including language modeling and machine translation, the Mixtape layer substantially improves the efficiency over the MoS layer by 3.5x to 10.5x while obtaining similar performance. A network equipped with Mixtape is only 20% to 34% slower than a softmax-based network with 10-30K vocabulary sizes, and outperforms softmax in perplexity and translation quality.